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Abstract 


Hierarchical clustering represents one of the most frequently adopted methodologies for identifying clusters in data 
in non-supervised classification tasks. Amongst the advantages of this family of approaches, we have that the possible 
solutions are obtained in a multiscale manner involving a respective dendrogram of the data. In addition to providing 
a more complete description of the interrelationships between the data elements, the number of clusters does not need 
to be specified as in other clustering methods such as k-means, as it can be inferred from the obtained dendrograms. 
There are several possible hierarchical clustering methods, depending on the adopted merging criterion, which can be 
the smallest distance between sets (single linkage), or the minimization of dispersion (Ward’s). The Jaccard index has 
also be considered for binary data. In this work, we propose a new family of hierarchical clustering methods, based on 
recent developments in which the Jaccard index is generalized to real values as well as on the coincidence index, which 
corresponds to the product between this generalized index and the interiority (or homogeneity) index. The former of 
these indices is more discriminative of anti-correlations, and the latter also provides a more strict comparison of the 
involved clusters. Therefore, it is expected that the coincidence index-based hierarchical cluster be less likely to yield 
false positive clusters than other hierarchical approaches. In addition, it becomes possible to start with the elements 


to be clustered represented by generic densities or even general scalar fields. 


‘The old tree, merged into the ground and sky.’ 
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1 Introduction 


Hierarchical clustering (e.g. [1, 2, 3]) constitutes one of the 
most often employed non-supervised methods as a conse- 
quence of its interesting features. Though there are two 
main classes of hierarchical clustering methods, namely 
divisive and agglomerative, in the present work we will 
focus on the latter. 

Basically, agglomerative hierarchical clustering meth- 
ods are characterized by the subsequent merging of clus- 
ters, defining a respective dendrogram that provides valu- 
able information about the interrelationships between the 
data and subclusters. In addition, unlike other methods 
such as k-means, the number of clusters does not need to 
be pre-specified. Actually, the most likely number can of- 
ten be estimated while taking into account the structure 
of the obtained dendrogram. In addition, agglomerative 
methods are conceptual and computationally simple. 

Agglomerative hierarchical clustering are defined re- 


spectively to the criterion adopted for merging the clus- 
ters. 
merging proceeds based on the smallest distance between 
the existing clusters. In the extensively used Ward’s 
method, the merging aims at maintaining the smallest 


For instance, in the single-linkage approach, the 


dispersion of the clusters. Therefore, a virtually an infi- 
nite number of possible agglomerative methods because 
there is an infinite number of possible merging criterion. 

Though potentially interesting, the use of the Jaccard 
similarity index in hierarchical clustering has been mostly 
limited to comparing sets (e.g. [4]). However, this index 
can be generalized to take into account real data [5, 6, 7], 
including respective densities. 

In particular, it has been shown [7] that the generalized 
Jaccard index relates directly to the prototypical similar- 
ity quantification by using the Kronecker’s delta function. 
More specifically, the Jaccard can be understood as a re- 
spective version that is more tolerant than the absolutely 
strict Kronecker-base criterion regarding data similarity. 
At the same time, the generalized Jaccard has also been 
found [7] to be more robust than the cosine distance as it 
penalizes more intensely the existing anti-correlations. 

Given that the classic Jaccard similarity index does not 
take into account the relative interiority between the com- 


pared sets, a new respective generalization has been pro- 
posed that incorporates an additional index for quantifi- 
cation of the interiority (or homgeneity) of the pairwise 
set combinations [5, 6, 7]. The resulting similarity index 
therefore provides a more strict quantification of the sim- 
ilarity between sets, vectors or even real functions. 

In the present work, we develop two new types of ag- 
glomerative hierarchical clustering based respectively on 
the real-valued Jaccard and coincidence inidices. 

We start by reviewing the two adopted index and then 
present the respectively obtained agglomerative hierar- 
chical clustering methodologies. A simple example is also 
provided in order to illustrate the proposed approaches. 


2 The Jaccard Index for Real Val- 
ues and the Coincidence Index 


The traditional Jaccard index [5]) has been extensively 
employed as a measurement of the similarity between two 
sets A and B, being defined as: 
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This index has been recently generalized to take into 
account multisets with possibly negative, real values [5, 
6, 7] as: 


ies Sziyi min {Sz; £i, Sy, Yi} 
Dies max {$2;Xi, Sy; Yi} 


where the multiplicities of the sets A and B are rep- 
resented as x; and y;, respectively, and S is the shared 





sı(A, B) = (2) 


multiset support, i.e. the elements underlying both mul- 
tisets (e.g. [6]). 

The Jaccard index has been shown not be able to take 
into account the relative interiority of the two sets [5]. 
However, we can adopt the following interiority (or ho- 
mogeneity) index: 


ies min EA Ti, Sy: Yi} 
min {Mies 8x,2iy ries Sy:yi} 





1(A, B) = (3) 


(4) 


So that the following new index, namely the coincidence 
index, can be obtained: 


C(A, B) = I(A, B)J(A, B) (5) 


3 Jaccard and Coincidence-Based 
Hierarchical Clustering 
The real-valued Jaccard index, as well as the coincidence 


index, can be adopted in order to obtain two respective 
new types of agglomerative hierarchical clustering. 


The basic idea is to proceed with the cluster merging so 
that the two current clusters presenting the largest simi- 
larity, as quantified by the real-valued Jaccard or coinci- 
dence indices, are merged at each step. 

Because the two similarity indices compare densities, 
kernel expansion of the current clusters are required at 
each step. Though in the current work we adopt circularly 
symmetric gaussian kernels, other generic choices can be 
adopted to suit specific requirements. In particular, it 
becomes possible to start with the individual elements 
representing not only isolated points, but whole densities 
of generic types and shapes. 

So, the real-valued and coincidence agglomerative clus- 
tering methods can be summarized as: 


Input data elements c1, C2,..., CN; 
m= l; 
While(m < N): 
Perform kernel expansion of the current clusters; 
Calculate the indices between these clusters; 
Join the two clusters that are more similar; 
Save list of obtained clusters; 


m=m+ l1; 


Observe that the calculation of the densities for the 
merged clusters can be immediately obtained by summing 
the respectively involved densities, without need of addi- 
tional kernel expansions. The densities are assumed to be 
normalized in the sense of having unit area. 


4 Case Example 


As a simple case example, consider the distribution of 
individuals represented in terms of two respective features 
as depicted in Figure 1. 


The successive mergings obtained by the coincidence- 
based hirarchical clustering is as follows: 
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Figure 1: Scatterplot for the case example. 
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5 Concluding Remarks 


Agglomerative hierarchical clustering has been exten- 
sively used in a large number of scientific and techno- 
logical areas (e.g. |1, 8, 2, 3]). Several of the most tra- 
ditional types of these methods have relied on distances 
between the clusters, while the the Ward’s approach con- 
sists of merging clusters so as to ensure minimal disper- 
sion. Though potentially interesting, the Jaccard index 
has been mostly constrained to agglomerative cluster- 
ing applications involving quantifications of similarity be- 
tween sets of objects. 

Two new types of agglomerative hierarchical clustering 
methods have been proposed, respectively based on the 


real-valued Jaccard and coincidence indices [5, 7]. These 
two generalizations of the classic Jaccard similarity index 
have been proposed recently to cope with negative real 
data values, while the latter index also takes into account 
the relative interiority between the involved sets [5, 6, 7]. 
These two generalizations have relied on data representa- 
tion as multisets (e.g. [9, 10, 11, 12, 13, 14]), understand- 
ing the multiplicity to encompass real values, including 
possibly negative quantities. 

Several interesting features are provided by the pro- 
posed methodology to hierarchical clustering. First, we 
have that the original data elements can be not only indi- 
vidual observation, but respective discrete (or even con- 
tinuous) generic density distributions, which can have any 
type of shape. Second, it becomes possible to compare 
not only non-negative densities, but any scalar fields as- 
sociated to the data elements, including negative values. 
Third, we have that the Jaccard and coincidence indices 
have been shown to be more robust for cluster comparison 
because the impose a higher penalty on anti-correlations 
between the involved distributions. The coincidence in- 
dex, in particular, takes into account the relative interior- 
ity of the densities, therefore implementing a more strict 
comparison between the densities. The latter feature is of 
particular potential relevance, because clustering methods 
have been shown to present a tendency to false positive 
identification of clusters [3]. 

Several further studies are motivated by the method- 
ology proposed in this work, including systematic com- 
parisons between several types of hierarchical (as well as 
other types) of clustering approaches, the consideration 
of different types of kernes possibly reflecting specific ap- 
plication requirements, as well as evaluating the method- 
ologies for higher dimensional feature spaces. 
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